An Introduction to Julia Programming for Data Analysis

Why Julia?

Julia is an open source programming language created at MIT in 2012. Julia combines the many advantages of the data science stack into one programming language:

-Julia has the statistical capabilities like R.

-Julia is as easy to pick up as a general programming language like Python.

-Julia can call 33 (and counting!) data visualization packages in Python, R, and Mathplot.

-Julia can handle scientific computing like Mathematica.

-Julia can handle parallel computing, and, thus, is faster than Python.

-Julia can call Python, C, and Fortran libraries.

-Julia is a programming language that compiles at run time, which means it is faster than Python.

All of these advantages mean Julia has the potential to serve as a "one stop shop" for data scientists looking to test machine learning models and then deploy them into production.

Do I think Julia is ready for the "big time" yet? No, not quite. It is still quite hard to use Julia to do the three steps of testing machine learning models: 1) build; 2) train; and 3) validate. Julia for machine learning is not fully developed; however, the potential is there. Do remember that Julia is only its official 1.0+ release so it has only stopped being in the "beta" stage in 2018.

Julia is a language to mark on your "to watch" list since it can become the tool of choice in the near future.

Here are a few resources to learn more about the advantages of Julia:

Bezanson, J., Karpinski, S., Shah, V., and Edelman, A. (2012). Why we created julia. Julia. Retrieved from https://julialang.org/blog/2012/02/why-we-created-julia/.

Yegulalp, S. (2020). Julia v. python: which is best for data science? InfoWorld. Retrieved from https://www.infoworld.com/article/3241107/julia-vs-python-which-is-best-for-data-science.html.

Julia (2020). Julia. Retrieved from https://julialang.org/.

Downloading Julia

You will need to download the current stable release of Julia from here: https://julialang.org/downloads/. I am running the 64-bit Windows version.

Connecting Julia to Jupyter Notebook

I am using Jupyter notebook via Anaconda.

From the Julia Command Prompt, type in the following:

using Pkg
Pkg.add("IJulia")

More Packages to Install

You will need to install the following packages in Julia Command Prompt:

Pkg.add("DataFrames") #this is similar to pandas in Python. 
Pkg.add("CSV") #this is to read in CSV files.
Pkg.add("StatsBase") #this is to do basic statistical analysis. 
Pkg.add("Plots") #this is an interface to other data visualization packages like Plotly, Gadfly, PyPlot, etc. Julia does not have a native data visualization package.  
Pkg.add("StatPlots") #this is another interface to data visualization packages that allow you to work with data frames.
Pkg.add("MLDataUtils") #this package enables data preprocessing tasks for machine learning. 
Pkg.add("ScikitLearn") #this is the ever popular machine learning package from Python.  

The Obligatory "Hello World"

Checking and Setting Working Directory

Read in DataFrame

Now let's check some basic attributes.

Julia uses 1-index (instead of Python's 0-index). Here we print row numbers 1 through 5.

And here we are printing out the first two columns of rows 1 through 5.

This is another way to reference columns.

We can use the same subsetting approach from R to print a subset.

Or we can use the filter function

You can use the show() function to print out all columns.

You can do even more with the describe() function. In particular, you can ask for the following statistical properties:
mean
std
min
q25
median
q75
max
eltype
nunique
first
last
nmissing

Basic Plots

Machine Learning with Scikit-Learn

References

Breloff, T. (2020). StatsPlots Documentation. GitHub. Retrieved from https://github.com/JuliaPlots/StatsPlots.jl.

Bezanson, J., Karpinski, S., Shah, V., and Edelman, A. (2012). Why we created julia. Julia. Retrieved from https://julialang.org/blog/2012/02/why-we-created-julia/.

England, A. (2018). Tutorial: tuning and fitting machine learning models with Julia. LinkedIn. Retrieved from https://www.linkedin.com/pulse/tutorial-tuning-fitting-machine-learning-models-julia-england-ph-d/.

Julia (2020). Julia. Retrieved from https://julialang.org/.

microgold (2018). Simple tools for train test split. Julia Discourse. Retrieved from https://discourse.julialang.org/t/simple-tool-for-train-test-split/473.

Yegulalp, S. (2020). Julia v. python: which is best for data science? InfoWorld. Retrieved from https://www.infoworld.com/article/3241107/julia-vs-python-which-is-best-for-data-science.html.